“As of mid-2018, the Google Maps Platform requires a registered API key”, so the gmaps example shown on lecture slides is no longer as simple to go through.
How do we do on a similar, non-local map? In R, there is a distance matrix for a bunch of European cities:
labels(eurodist)
## [1] "Athens" "Barcelona" "Brussels" "Calais"
## [5] "Cherbourg" "Cologne" "Copenhagen" "Geneva"
## [9] "Gibraltar" "Hamburg" "Hook of Holland" "Lisbon"
## [13] "Lyons" "Madrid" "Marseilles" "Milan"
## [17] "Munich" "Paris" "Rome" "Stockholm"
## [21] "Vienna"
If we run MDS on this distance matrix, we get the following results:
plot(cmdscale(eurodist)[,1], cmdscale(eurodist)[,2], type = "n", xlab = "", ylab = "", asp = 1, axes = FALSE, main = "MDS Results on eurodist")
text(cmdscale(eurodist)[,1], cmdscale(eurodist)[,2], rownames(cmdscale(eurodist)), cex = 0.6)
Again, something’s not right. Specifically, we would expect Stockholm to be the most northern city of the bunch and Athens to be the most southern. Let’s simply multiply the second column by -1.
plot(cmdscale(eurodist)[,1], -cmdscale(eurodist)[,2], type = "n", xlab = "", ylab = "", asp = 1, axes = FALSE, main = "MDS Results on eurodist")
text(cmdscale(eurodist)[,1], -cmdscale(eurodist)[,2], rownames(cmdscale(eurodist)), cex = 0.6)
This is more promising, you can compare to the europecircled.png file on Github to see how well MDS has done.
Let’s run FA on the car93.csv data set that we looked at
a number of weeks ago with PCA…
card <- read.csv("~/Downloads/car93.csv", stringsAsFactors = FALSE)
Let’s fit for two factors
facar <- factanal(card[,-c(1:3)], 2, scores="regression")
facar
##
## Call:
## factanal(x = card[, -c(1:3)], factors = 2, scores = "regression")
##
## Uniquenesses:
## Price MPG.city MPG.highway EngineSize
## 0.328 0.289 0.358 0.107
## Horsepower RPM Rev.per.mile Fuel.tank.capacity
## 0.107 0.456 0.310 0.174
## Length Wheelbase Width Turn.circle
## 0.115 0.141 0.143 0.287
## Rear.seat.room Luggage.room Weight
## 0.625 0.360 0.018
##
## Loadings:
## Factor1 Factor2
## Price 0.812 0.113
## MPG.city -0.745 -0.395
## MPG.highway -0.760 -0.255
## EngineSize 0.684 0.652
## Horsepower 0.932 0.155
## RPM -0.735
## Rev.per.mile -0.522 -0.646
## Fuel.tank.capacity 0.796 0.437
## Length 0.577 0.742
## Wheelbase 0.597 0.709
## Width 0.586 0.717
## Turn.circle 0.510 0.673
## Rear.seat.room 0.220 0.571
## Luggage.room 0.284 0.748
## Weight 0.834 0.536
##
## Factor1 Factor2
## SS loadings 6.155 5.025
## Proportion Var 0.410 0.335
## Cumulative Var 0.410 0.745
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 318.8 on 76 degrees of freedom.
## The p-value is 1.74e-31
The print method will leave empty any coefficients that fall below some threshold (I think below .10).
We can see approximately the same amount of variation described by the two factors (74.5%) as was described by the two components (78%), and also fairly similar loadings on the first factor as the first component. That said, the second component is a little less intuitive than it was with PCA.
Note the provided p-value (highly significant). This hypothesis test has a null hypothesis that 2 factors are sufficient for fitting this data…therefore a significant p-value is telling you that there is evidence to suggest that 2 factors are NOT sufficient.
In any case, let’s take a look at the scores on the first two variables
plot(facar$scores)
This looks fairly similar to the PCA results. Let’s see what happens when clustering
faclust <- hclust(dist(facar$scores))
plot(faclust)
Hmm, this looks more like three groups
table(card[,3], cutree(faclust,3))
##
## 1 2 3
## Compact 9 1 6
## Large 0 0 11
## Midsize 0 10 12
## Small 21 0 0
## Sporty 9 1 2
In this case, we still have a mostly small car group and a mostly large car group, but also a group made of half of the midsize cars. Which leads to questioning why those midsizes versus the others?
cgroups <- cutree(faclust,3)
card[cgroups==2 & card$Type=="Midsize", 1:3]
## Manufacturer Model Type
## 2 Acura Legend Midsize
## 4 Audi 100 Midsize
## 5 BMW 535i Midsize
## 11 Cadillac Seville Midsize
## 43 Infiniti Q45 Midsize
## 44 Lexus ES300 Midsize
## 45 Lexus SC300 Midsize
## 52 Mercedes-Benz 300E Midsize
## 56 Mitsubishi Diamante Midsize
## 82 Volvo 850 Midsize
card[cgroups==3 & card$Type=="Midsize", 1:3]
## Manufacturer Model Type
## 6 Buick Century Midsize
## 9 Buick Riviera Midsize
## 15 Chevrolet Lumina Midsize
## 23 Dodge Dynasty Midsize
## 32 Ford Taurus Midsize
## 42 Hyundai Sonata Midsize
## 46 Lincoln Continental Midsize
## 54 Mercury Cougar Midsize
## 59 Nissan Maxima Midsize
## 61 Oldsmobile Cutlass_Ciera Midsize
## 67 Pontiac Grand_Prix Midsize
## 77 Toyota Camry Midsize
As you can see, that second group appears to correspond to the luxury midsize vehicles. So then the groups discovered are approximately ‘smaller cars’, ‘luxury midsize cars’, and ‘larger cars’.
Find the pain.rda file on github. It’s the faces data
that you saw in lecture.
load("~/Downloads/pain.rda")
I will run you through all of the steps used to generate the outputs you saw in lecture. First, note the structure of the data
dim(pain)
## [1] 241 181 84
With a bit of knowledge of the data, it’s clear this is an array where the first two dimensions make up the pixels and the third dimension indexes the different pictures. Let’s try out the image function!
image(pain[,,1])
Two problems. It’s obviously sideways, but I’m also not a big fan of the heatmap colour-scheme. Let’s fix both…
image(t(pain[,,1]), col=gray((0:32)/32))
Much better, the following code can cycle you through all the
pictures (3 a second)…recommend to set eval=FALSE for the
markdown document, and you’ll need ffmpeg installed on your computer
(probably not worth troubleshooting across operating systems and
whatnot)
for(i in 1:84){
image(t(pain[,,i]), col=gray((0:32)/32))
#Sys.sleep(1/3)
}